Implement multithreading in qgemm_kleidi #26301

melkap01-Arm · 2025-10-14T14:33:35Z

Key changes

This PR makes changes to improve the performance on Dynamic Qgemms by implementing tiling and threading across operations.

The changes introduce thread local buffers for reusing memory during inference. And utilizes those in Dynamic Quantised Matmul operations using Kleidiai kernels.

And updating KleidiAI version to 1.15.0

Example performance

single thread :

2 threads :

melkap01-Arm · 2025-10-14T14:55:37Z

@microsoft-github-policy-service agree company="Arm"

hariharans29 · 2025-10-14T16:47:50Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-14T16:48:09Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-16T17:09:19Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-16T17:09:37Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-10-24T20:10:26Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-24T20:10:45Z

Azure Pipelines successfully started running 4 pipeline(s).

cmake/deps.txt

patryk-kaiser-ARM · 2025-10-28T10:48:42Z

Can we get workflows ran please

hariharans29 · 2025-10-28T16:19:52Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-10-28T16:20:12Z

Azure Pipelines successfully started running 4 pipeline(s).

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

hariharans29 · 2025-10-28T20:07:18Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

hariharans29 · 2025-10-30T21:11:42Z

Will trigger CI once you push commits addressing the PR feedback (right now I only see a rebase). Thanks.

melkap01-Arm · 2025-10-31T17:25:30Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

hariharans29 · 2025-10-31T17:48:04Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

melkap01-Arm · 2025-11-04T10:46:55Z

General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ?

We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage.

If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ?

It means it was not exercised on the onnxruntime_mlas_test run, but it is on the onnxruntime_perf_test. However, unit tests for the multithreaded code added now, in the latest commit. Both cases can use multiple threads in the latest situation.

Signed-off-by: melkap01 <[email protected]>

unused variable removed, unnecessary temp_tile use and copy removed, K==0 case checked Signed-off-by: melkap01 <[email protected]>

Signed-off-by: melkap01 <[email protected]>

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

onnxruntime/test/contrib_ops/matmul_integer_to_float_test.cc

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2026-01-15T15:53:47Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-15T15:54:10Z

Azure Pipelines successfully started running 4 pipeline(s).

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

Signed-off-by: Jonathan Clohessy <[email protected]>

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

Copilot

Pull request overview

This PR implements multithreading and tiling for Dynamic Quantized GEMM operations using KleidiAI kernels to improve performance on ARM64 SME/SME2 architectures. The changes introduce thread-local buffers for memory reuse during inference and update KleidiAI to version 1.15.0.

Changes:

Refactored dynamic quantization matrix multiplication to use thread-local buffers and parallel tiling across batch, M, and N dimensions
Moved KleidiAI packing logic from operator-specific code to a reusable base class
Extended test coverage to include single-threaded and multi-threaded test suites with edge cases

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp	Splits tests into single-thread and thread-pool variants, adds proper quantization simulation and edge case handling
onnxruntime/test/contrib_ops/dynamic_quantize_matmul_test.cc	Adds KleidiAI-specific tests for bias handling, zero-point validation, and fallback scenarios
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h	Extracts KleidiAI prepacking logic into reusable helper methods in the base class
onnxruntime/core/mlas/lib/qgemm.cpp	Updates availability check to include both SME and SME2
onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp	Implements multi-threaded tiling with thread-local buffers and adds input validation
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h	Adds UseSME flag alongside existing UseSME2
onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc	Simplifies by delegating prepacking to base class and removes duplicate code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc

onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp

hariharans29 · 2026-01-16T02:52:53Z

Please rebase with main and the CUDA / TensorRT issues should go away

hariharans29 · 2026-01-16T05:55:31Z

May have some conflicts with #26849

Signed-off-by: Jonathan Clohessy <[email protected]>

hariharans29 · 2026-01-16T16:10:38Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-16T16:11:00Z

Azure Pipelines successfully started running 4 pipeline(s).

patryk-kaiser-ARM mentioned this pull request Oct 24, 2025

Fix: Disable KleidiAI on systems with SME1 but not SME2 #26399

Closed

hariharans29 reviewed Oct 24, 2025

View reviewed changes

cmake/deps.txt Outdated Show resolved Hide resolved

hariharans29 reviewed Oct 28, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp Show resolved Hide resolved

hariharans29 reviewed Oct 28, 2025

View reviewed changes

onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp Outdated Show resolved Hide resolved